Kolmogorov Complexity and Model Selection
نویسنده
چکیده
The goal of statistics is to provide explanations (models) of observed data. We are given some data and have to infer a plausible probabilistic hypothesis explaining it. Consider, for example, the following scenario. We are given a “black box”. We have turned the box on (only once) and it has produced a sequence x of million bits. Given x, we have to infer a hypothesis about the black box. Classical mathematical statistics does not study this question. It considers only the case when we are given results of many independent tests of the box. However, in the real life, there are experiments that cannot be repeated. In some such cases the common sense does provide a reasonable explanation of x. Here are three examples: (1) The black box has printed million zeros. In this case we probably would say that the box is able to produce only zeros. (2) The box has produced a sequence without any regularities. In this case we would say that the box produces million independent random bits. (3) The first half of the sequence consists of zeros and the second half has no regularities. In this case we would say that the box produces 500000 zeros and then 500000 independent random bits. Let us try to understand the mechanism of such common sense reasoning. First, we can observe that in each of the three cases we have inferred a finite set A including x. In the first case, A consists of x only. In the second case, A consists of all sequences of length million. In the third case, the set includes all sequences whose first half consists of only zeros. Second, in all the three cases the set A can be described in few number of bits. That is A has low Kolmogorov complexity. Third, all regularities present in x are shared by all other elements of A. That is, x is a “typical element of A”. It seems that the common sense reasoning works as follows: given a string x of n bits we find a finite set A of strings of length n containing x such that (1) A has low Kolmogorov complexity (we are interested in simple explanations) and
منابع مشابه
A MULTI-OBJECTIVE OPTIMIZATION MODEL FOR PROJECT PORTFOLIO SELECTION CONSIDERING AGGREGATE COMPLEXITY: A CASE STUDY
Existing project selection models do not consider the complexity of projects as a selection criterion, while their complexity may prolong the project duration and even result in its failure. In addition, existing models cannot formulate the aggregate complexity of the selected projects. The aggregated complexity is not always equal to summation of complexity of projects because of possible syne...
متن کاملMinimum complexity density estimation
The minimum complexity or minimum description-length criterion developed by Kolmogorov, Rissanen, Wallace, So&in, and others leads to consistent probability density estimators. These density estimators are defined to achieve the best compromise between likelihood and simplicity. A related issue is the compromise between accuracy of approximations and complexity relative to the sample size. An i...
متن کاملFew paths, fewer words: model selection with automatic structure functions
We consider the problem of finding an optimal statistical model for a given binary string. Following Kolmogorov, we use structure functions. In order to get concrete results, we replace Turing machines by finite automata and Kolmogorov complexity by Shallit and Wang’s automatic complexity. The p-value of a model for given data x is the probability that there exists a model with as few states, a...
متن کاملInference by Conversion
We are discussing a modeling technique based on the idea to generate data sequences with a number of suggested models. These sequences are transformed, or converted, into an observed data sequence by a suitable function, or a program. The motivation for doing so is in cases where the likelihood of observed data is hard to compute, which is circumvented with an indirect approximation by trying t...
متن کاملModel Selection for Nested Model Classes with Cost Constraints
For model selection with nested classes, we propose to minimize Rissanen's stochastic complexity with a constraint on expected computational cost. The proposed solution uses the Wald statistic and dynamic programming for order selection, such that maximum likelihood estimates of only a small subset of hypothesized models needs to be computed. Simulation results are presented to compare computat...
متن کامل